2016 SEG Machine Learning Contest
The purpose of this notebook is to establish a new approach to evaluating models for this contest. I propose a method which borrows from the K-Folds and Leave-one-out methods, wherein we build the model several times; each model is built by leaving out one well as the test set.
Time to load supporting libraries and the data! We'll also extract the Shankle well from the training data into a blind data set to be used for testing our final model.
In [8]:
# visualization packages
library(repr)
library(ggplot2)
library(ggthemes)
library(cowplot)
# machine learning packages
library(e1071)
library(caret)
In [2]:
# load data
fname <- "../facies_vectors.csv"
data <- read.csv(fname, colClasses=c(rep("factor",3), rep("numeric",6), "factor", "numeric"))
# convert NM_M channel into a binary channel "isMarine"
data$NM_M <- data$NM_M == "2"
names(data)[10] <- "isMarine"
# make the Facies channel more descriptive
levels(data$Facies) <- c("SS", "CSiS", "FSiS", "SiSh", "MS", "WS", "D", "PS", "BS")
# remove any incomplete records (we know from jpoirier001.ipynb PE channel is missing some values)
data <- data[complete.cases(data),]
# split out SHANKLE well as test set into 'blind' data frame
test_index <- data$Well.Name == "SHANKLE"
blind <- data[test_index,]
data <- data[!test_index,]
# display first five rows of data set
head(data)
The contest is using the overall F1-score (applied to the blind well Shankle) to evaluate models. The highest F1-score wins! It thereby makes sense that we evaluate our models using the F1-score from tuning through training and testing. Let's devise a function to take in the models predicted and true values and calculate it's Precision, Recall, and F1-score.
In [3]:
accuracy_metrics <- function(cm, ytrue) {
# initialize vectors for precision, recall, and f1 metrics with zeros
prec <- rep(0,9)
recall <- rep(0,9)
f1 <- rep(0,9)
# loop through facies to compute precision, recall, and f1 for each facies
beta <- 1
for (i in 1:9) {
prec[i] <- cm[i,i] / sum(cm[i,])
recall[i] <- cm[i,i] / sum(cm[,i])
f1[i] <- (1 + beta^2) * prec[i] * recall[i] / ((beta^2 * prec[i]) + recall[i])
}
prec[is.na(prec)] <- 0
recall[is.na(recall)] <- 0
f1[is.na(f1)] <- 0
support <- as.matrix(table(ytrue))
tot_precision <- sum(prec * support) / sum(support)
tot_recall <- sum(recall * support) / sum(support)
tot_f1 <- sum(f1 * support) / sum(support)
c(tot_precision, tot_recall, tot_f1)
}
eval_model <- function(ypred, ytrue) {
cm <- confusionMatrix(ypred, ytrue)
accuracy_metrics(as.matrix(cm[["table"]]), ytrue)
}
Noww let's apply our well-folds methodology wherein we loop through the wells and the current well iteration functions as the cross-validation data set while the remaining wells function as the training data. Our tuning in jpoirier001.ipynb found that (using the randomly split data) the optimal SVM parameters were 10 for Cost and 1 for Gamma.
Also note that we will not isolate the Recruit F9 "well" because it is not a true well but a manually selected set of observations of the Bafflestone facies.
Let's start by building a function which tunes an SVM (Support Vector Machine) model. We'll have it taken in two parameters: data which is a data frame containing all observations from all wells, and well which is a string identifying which well should be used as the cross-validation set. The function will start out by isolating the given well's data followed by tuning the SVM algorithm using a variety of Cost and Gamma parameters. It then returns a list of each parameter pair and their associated performance. Please note, this function may take several (if not tens of) minutes to complete.
In [4]:
# WARNING: this function will take several minutes to complete
tune_svm <- function(train, cv) {
# tuning parameters
costs <- c(.01,5,10,15,20,25,30)
gammas <- c(.0001,.001,.01,.1,1,10)
# initialize performance dataframe
performances <- data.frame(cost=numeric(), gamma=numeric(),
precision=numeric(), recall=numeric(), f1=numeric())
# loop through each cost and gamma pairing
for (c in costs) {
for (g in gammas) {
set.seed(3124)
# build model and predict facies
fit <- svm(Facies ~ ., data=train, kernel='radial', cost=c, gamma=g)
pred <- predict(fit, newdata=cv)
# evaluate model and store performance
metrics <- eval_model(pred, cv$Facies)
temp <- data.frame(cost=c, gamma=g, precision=metrics[1], recall=metrics[2], f1=metrics[3])
performances <- rbind(performances, temp)
}
}
# return data frame of parameters and performances
performances
}
Now let's implement a new cross-validation method. I'm calling it Well-Folds. It combines the ideas behind K-folds and leave-one-out. The idea is to leave-one-out where one is an entire well of observations. It is similar to K-folds in that we are folding over the data set.
Why Well-Folds? In this type of geoscience problem we are interested in developing a predictive model to classify facies for wells which do not have facies data recorded from core. While classic machine learning problems are focused on predicting the outcome of a new observation, this geoscience application is interested in predicting the outcome (facies) for a set of observations from a new well. The Well-Folds method best simulates this.
Let's now define the Wells-fold function. It will take in two parameters. The first parameter, data, is simply the data frame containing the observations for all wells. The second parameter, fxn, is a function which is to be applied for each well in the data data frame.
In [5]:
well_folds_cv <- function(data, fxn) {
# list of wells
wells <- unique(data$Well.Name)
wells <- wells[-(which(wells == "Recruit F9"))]
# initialize performances data frame
performances <- data.frame(cost=numeric(), gamma=numeric(),
precision=numeric(), recall=numeric(), f1=numeric(),
well=factor())
# tune algorithm for each well, record and return performance
for (well in wells) {
cvIndex <- data$Well.Name == well
temp <- fxn(data[!cvIndex, -c(2,3,4)], data[cvIndex, -c(2,3,4)])
temp$well <- well
performances <- rbind(performances, temp)
}
# return data frame of parameters (incl. well) and their performance
performances
}
Now that we have our cross-validation (well-folds) and tuning functions (tune_svm) defined, let's apply them to our data set and preview the resulting parameters and performances.
In [9]:
# tune the svm parameters
# WARNING: this task can take several minutes (if not tens of minutes) to complete
tuning <- well_folds_cv(data, tune_svm)
head(tuning)
From the first few rows of the tuning results we can see that the model performance (measured by error and dispersion) vary depending on the Cost and Gamma values. For each Cost/Gamma pair, we have a precision, recall, and f1 for each well. Now let's average those performance figures across the different wells for each Cost/Gamma pair and output those parameters which maximize the average F1-score.
In [10]:
# extract the unique cost and gamma values used
costs <- unique(tuning$cost)
gammas <- unique(tuning$gamma)
# initialize dataframe for statistics
df <- data.frame(cost=numeric(),
gamma=numeric(),
mean_precision=numeric(),
mean_recall=numeric(),
mean_f1=numeric())
# loop through costs and gammas vectors, calculate performance stats for each pair
for (cost in costs) {
for (gamma in gammas) {
# retrieve rows for current cost and gamma values
temp <- tuning[tuning$cost == cost & tuning$gamma == gamma,]
# calculate mean and standard deviation of error
mean_precision <- mean(temp$precision)
mean_recall <- mean(temp$recall)
mean_f1 <- mean(temp$f1)
# add the calculated stats to dataframe
df <- rbind(df, data.frame(cost=cost,
gamma=gamma,
mean_precision=mean_precision,
mean_recall=mean_recall,
mean_f1=mean_f1))
}
}
# identify parameters with minimum average error
dfmax <- df[which.max(df$mean_f1),]
round(dfmax,2)
This is interesting. In jpoirier001.ipynb our tuning identified a Cost of 10 and Gamma of 1 to be optimal. Here, we've found that a Cost of 15 and Gamma of 0.01 performs best. To paint a more visual portrait of how the Cost and Gamma parameter selection influence model performance.
Let's graph the average precision, recall, and F1-score for each Cost/Gamma pair. Maximizing the average F1-score will give us the ideal parameters.
In [11]:
options(repr.plot.width=8, repr.plot.height=3)
# mean precision across wells
g1 <- ggplot(df, aes(cost, gamma, fill=mean_precision)) + theme_economist_white(gray_bg=T) +
geom_raster(alpha=.8) +
geom_point(data=dfmax, aes(cost, gamma)) +
geom_text(aes(x=dfmax$cost[1]-4, y=dfmax$gamma[1],
label=paste("Maximum mean F1-Score =", round(dfmax$mean_f1,2),
"\n Cost=", dfmax$cost[1], "\n Gamma=", dfmax$gamma[1])),
size=2) +
scale_y_log10() +
labs(x="Cost", y="Gamma", title="Mean Precision") +
scale_fill_distiller(palette="Spectral", name="", direction=-1) +
theme(legend.position='none',
plot.title=element_text(size=8),
axis.text=element_text(size=8),
axis.title=element_text(size=8),
legend.text=element_text(size=8))
# mean recall across wells
g2 <- ggplot(df, aes(cost, gamma, fill=mean_recall)) + theme_economist_white(gray_bg=T) +
geom_raster(alpha=.8) +
geom_point(data=dfmax, aes(cost, gamma)) +
geom_text(aes(x=dfmax$cost[1]-4, y=dfmax$gamma[1],
label=paste("Maximum mean F1-Score =", round(dfmax$mean_f1,2),
"\n Cost=", dfmax$cost[1], "\n Gamma=", dfmax$gamma[1])),
size=2) +
scale_y_log10() +
labs(x="Cost", y="Gamma", title="Mean Recall") +
scale_fill_distiller(palette="Spectral", name="", direction=-1) +
theme(legend.position='none',
plot.title=element_text(size=8),
axis.text=element_text(size=8),
axis.title=element_text(size=8),
legend.text=element_text(size=8))
# mean f1-score across wells
g3 <- ggplot(df, aes(cost, gamma, fill=mean_f1)) + theme_economist_white(gray_bg=T) +
geom_raster(alpha=.8) +
geom_point(data=dfmax, aes(cost, gamma)) +
geom_text(aes(x=dfmax$cost[1]-4, y=dfmax$gamma[1],
label=paste("Maximum mean F1-Score =", round(dfmax$mean_f1,2),
"\n Cost=", dfmax$cost[1], "\n Gamma=", dfmax$gamma[1])),
size=2) +
scale_y_log10() +
labs(x="Cost", y="Gamma", title="Mean F1-Score") +
scale_fill_distiller(palette="Spectral", name="", direction=-1) +
theme(legend.position='none',
plot.title=element_text(size=8),
axis.text=element_text(size=8),
axis.title=element_text(size=8),
legend.text=element_text(size=8))
# bring two plots together and display
g <- plot_grid(g1, g2, g3, ncol=3)
ggdraw() +
draw_plot(g, width=1, height=1, y=-.01) +
draw_plot_label("SVM Tuning", size=10)
These raster plots show that our chosen cost and gamma of 15 and 0.01 respectively maximize the overall F1-score (0.49), and appear to be maxima for the overall precision and recall as well.
Let's now begin training our model using the optimized tuning parameters. We'll use our Well-folds algorithm to average our metrics, gaining an idea of how the model is performing. But first, we need to define our training function.
In [25]:
# WARNING: this function will take several minutes to complete
train_svm <- function(train, cv) {
set.seed(3124)
# build model and predict facies
fit <- svm(Facies ~ ., data=train, kernel='radial', cost=15, gamma=.01)
pred <- predict(fit, newdata=cv)
# evaluate model and store performance
metrics <- eval_model(pred, cv$Facies)
data.frame(cost=15, gamma=.01, precision=metrics[1], recall=metrics[2], f1=metrics[3])
}
Our training function (above) takes in a training set and a cross-validation set. The function builds an SVM model from the training set and applies it to the cross-validation set, returning metrics on the models performance. Let's apply it to our data!
In [26]:
# apply well-folds to train model
# WARNING: this task can a couple of minutes to complete
train_performance <- well_folds_cv(data, train_svm)
train_performance_round <- train_performance
train_performance_round[,c(1,2,3,4,5)] <- round(train_performance[,c(1,2,3,4,5)],2)
train_performance_round
Interesting! Our model seems to be performing in the 0.5 through 0.6 F1-score range with exception to the Cross H Cattle well (we'll worry about this in another notebook). Let's average these out for an estimated model performance.
In [27]:
round(apply(train_performance[, c(3,4,5)], 2, mean),2)
This suggests our training model has an average F1-score of 0.49. Now let's apply the model to the blind data set!
In [28]:
set.seed(3124)
# build model and predict facies
fit <- svm(Facies ~ ., data=data[,-c(2,3,4)], kernel='radial', cost=15, gamma=.01)
pred <- predict(fit, newdata=blind)
# evaluate model
cm <- confusionMatrix(pred, blind$Facies)
metrics <- round(accuracy_metrics(as.matrix(cm[["table"]]), blind$Facies),2)
print(paste("Overall Precision:", metrics[1]))
print(paste("Overall Recall:", metrics[2]))
print(paste("Overall F1-score:", metrics[3]))